Jump to content

NPU Enabling and Usage

From RidgeRun Developer Wiki


Follow us on: YouTube Twitter LinkedIn Email Share this page

Share This Page




Introduction

The Dragonwing 9075 EVK has a dedicated NPU that delivers up to 100 Dense TOPS of performance that runs 13Bn parameter models and generates 12 tokens per second.

On Yocto, the layer recipes-ml provides recipes for some Qualcomm AI runtime SDK components. On Ubuntu, the SDK is partially included to be able to run some sample applications and some GStreamer pipelines.

Specification Value
NPU name Qualcomm Hexagon[1]
NPU architecture Dual Hexagon Tensor Processors[2]
Compute extensions Vector and matrix extensions[3]
Vector accelerator Quad Qualcomm Hexagon Vector eXtensions (HVX)[4]
Matrix accelerator Dual Qualcomm Hexagon Matrix eXtensions (HMX) coprocessors[4]
Integrated DSP Qualcomm Hexagon DSP[4]
Peak NPU performance, QCS9075-AC Up to 50 dense TOPS[1]
Peak NPU performance, QCS9075-AA Up to 100 dense TOPS[1]
Peak sparse-equivalent performance Up to 200 equivalent sparse TOPS[2]
INT8 AI performance Up to 100 INT8 TOPS[3]
Example generative AI workload Llama 2 7B at up to 22 tokens/s[1]
NPU software backend QNN HTP backend / Hexagon Tensor Processor backend[5]
Quantized network support Quantized 8-bit and quantized 16-bit networks[5]
Floating-point support Float32 networks using float16 math on select Qualcomm SoCs[5]
Operator / layer support source QAIRT / QNN Supported Operations, HTP backend columns[6]

The full QAIRT SDK supports the following operations:

Supported layers Layer type Datatype Backend
Conv1d, Conv2d, Conv3d, DepthWiseConv1d, DepthWiseConv2d Convolution FP32 / FP16 / INT8 / INT16 CPU, HTP, HTP FP16, GPU, LPAI
FullyConnected, MatMul Dense / matrix multiplication FP32 / FP16 / INT8 / INT16 CPU, HTP, HTP FP16, GPU, LPAI
PoolAvg2d, PoolAvg3d, PoolMax2d, PoolMax3d, L2Pool2d Pooling FP32 / FP16 / INT8 / INT16 CPU, HTP, HTP FP16, GPU, LPAI
Relu, Prelu, Elu, Gelu, HardSwish, Sigmoid, Tanh, Softplus Activation FP32 / FP16 / INT8 / INT16 CPU, HTP, HTP FP16, GPU, LPAI
ElementWiseAdd, ElementWiseSubtract, ElementWiseMultiply, ElementWiseDivide, ElementWisePower, ElementWiseMaximum, ElementWiseMinimum, ElementWiseSquaredDifference Element-wise arithmetic FP32 / FP16 / INT8 / INT16 CPU, HTP, HTP FP16, GPU, LPAI
ElementWiseAbs, ElementWiseCeil, ElementWiseCos, ElementWiseExp, ElementWiseFloor, ElementWiseLog, ElementWiseNeg, ElementWiseRound, ElementWiseRsqrt, ElementWiseSin, ElementWiseSquareRoot Element-wise unary FP32 / FP16 CPU, HTP, HTP FP16, GPU, LPAI
ElementWiseEqual, ElementWiseNotEqual, ElementWiseGreater, ElementWiseGreaterEqual, ElementWiseLess, ElementWiseLessEqual Element-wise comparison FP32 / FP16 / INT8 / INT16 CPU, HTP, HTP FP16, GPU, LPAI
ElementWiseAnd, ElementWiseOr, ElementWiseXor, ElementWiseNot Element-wise logical Boolean / integer CPU, HTP, HTP FP16, GPU, LPAI
Batchnorm, InstanceNorm, LayerNorm, GroupNorm, L2Norm, Lrn Normalization FP32 / FP16 / INT8 / INT16 CPU, HTP, HTP FP16, GPU, LPAI
ReduceMax, ReduceMean, ReduceMin, ReduceProd, ReduceSum, ReduceSumSquare Reduction FP32 / FP16 / INT8 / INT16 CPU, HTP, HTP FP16, GPU, LPAI
Softmax, LogSoftmax, MaskedSoftmax Softmax FP32 / FP16 / INT8 / INT16 CPU, HTP, HTP FP16, GPU, LPAI
Reshape, Squeeze, ExpandDims, Transpose, Permute, Pack, Unpack, Concat, Split, Slice Tensor shape / layout Input datatype CPU, HTP, HTP FP16, GPU, LPAI
Pad, Tile, Gather, GatherElements, GatherNd, OneHot, NonZero Tensor indexing / construction FP32 / FP16 / integer / INT8 / INT16 CPU, HTP, HTP FP16, GPU, LPAI
Quantize, Dequantize, Cast, Convert Datatype conversion FP32 / FP16 / INT8 / INT16 CPU, HTP, HTP FP16, GPU, LPAI
Resize, CropAndResize, GridSample, ExtractPatches, DepthToSpace, SpaceToDepth, BatchToSpace Vision / spatial transform FP32 / FP16 / INT8 / INT16 CPU, HTP, HTP FP16, GPU, LPAI
Argmax, Argmin, TopK Selection / ranking FP32 / FP16 / integer / INT8 / INT16 CPU, HTP, HTP FP16, GPU, LPAI
Lstm, Gru Recurrent FP32 / FP16 / INT8 / INT16 CPU, HTP, HTP FP16, GPU, LPAI
NonMaxSuppression, MultiClassNms, CombinedNms, BoxWithNmsLimit, DetectionOutput, GenerateProposals, CollectRpnProposals, DistributeFpnProposals, BboxTransform Detection / proposal generation FP32 / FP16 / INT8 / INT16 CPU, HTP, HTP FP16, GPU, LPAI

GStreamer elements for NPU use

Gstreamer elements typically used for AI applications

Element Type Description
qtimlqnn QNN inference Qualcomm QNN-based inference element. This is the most direct GStreamer candidate for QNN/HTP/NPU execution; set the backend from the default CPU library to the HTP backend when validating NPU.
qtimltflite TensorFlow Lite inference Runs TFLite models. For NPU testing, use delegate=external with the QNN TFLite delegate and HTP backend options.
qtimlsnpe SNPE inference Runs SNPE/DLC models with delegate options for CPU, DSP, GPU, or AIP. Useful for legacy Qualcomm AI pipelines, but QNN/TFLite paths are more relevant for HTP/NPU validation.
qtimlvconverter Video-to-tensor preprocessing Converts video/x-raw frames into neural-network/tensors before inference. Used before qtimlqnn, qtimltflite, or qtimlsnpe in video AI pipelines.
qtimlaconverter Audio-to-tensor preprocessing Converts mono raw audio into neural-network/tensors. Supports raw, spectrogram, MFE, LMFE, and MFCC features for audio ML pipelines.
qtimlpostprocess Generic ML postprocessor Preferred post-processing element for converting inference tensors into video, text, or tensor outputs. Supports modules for detection, classification, segmentation, pose, OCR, depth, face, audio, and related AI tasks.
qtimldemux Tensor demuxer Splits batched neural-network/tensors into separate tensor streams. Useful after batched inference or when separating multiple tensor outputs.
qtibatch Batch muxer Batches buffers from multiple streams into one output buffer. Useful when testing batched or multi-stream inference.
qtimlmetaextractor ML metadata extractor Extracts ML metadata from video buffers into UTF-8 text buffers for logging, debugging, or publishing inference results.
qtimlmetaparser ML metadata parser Parses ML metadata from video or text buffers. Useful when converting inference metadata into a structured text representation such as JSON.
qtimetamux Metadata muxer Attaches text or optical-flow metadata as GstMeta to raw video/audio buffers, allowing inference results to travel with the media stream.
qtimetatransform Metadata transform Filters or transforms metadata attached to video buffers. Useful for ROI-based AI flows or smoothing label/ROI metadata.
qtiobjtracker Object tracker Tracks detected objects across frames after detection post-processing. Useful after object detection inference to maintain object IDs over time.

Running AI sample applications [7]

Qualcomm offers some AI Sample Applications for object detection and parallel inferencing from input sources such as a camera, a video file or an RTSP stream to stream on the Dragonwing IQ-9075 device. To run the application use the following workflow:

  1. Download models and labels
  2. Transfers the downloaded files to the device
  3. Run AI sample applications

Download and transfer AI models and labels

The required models can be downloaded from Qualcomm AI Hub. This are the required files for some example applications:

Sample application Models required
AI object detection yolox_quantized.tflite
Parallel AI inference yolox_quantized.tflite
Inception-v3
HRNetPose
DeepLabV3-Plus-MobileNet
Multistream inference yolox_quantized.tflite
Inception-v3

To download with automated script, create working directory on board:

WORKING_DIR=~/AI_Examples
mkdir $WORKING_DIR
cd $WORKING_DIR
sudo apt install unzip

Get script:

curl -L -O https://raw.githubusercontent.com/quic/sample-apps-for-qualcomm-linux/refs/heads/main/qualcomm-linux/scripts/download_artifacts.sh

Give executable permission:

chmod +x download_artifacts.sh

Execute script:

sudo ./download_artifacts.sh

gst-ai-object-detection application

To setup, setup the configuration file created in /etc/configs/config_detection.json

sudo vim /etc/configs/config_detection.json

To run with video example as source, change the file as follows

{
  "file-path": "/etc/media/video.mp4",
  "ml-framework": "tflite",
  "yolo-model-type": "yolox",
  "model": "/etc/models/yolox_quantized.tflite",
  "labels": "/etc/labels/yolox.json",
  "threshold": 40,
  "runtime": "dsp",
  "output-type": "waylandsink"
}

To run with camera source:

{
  "camera": 0,
  "ml-framework": "tflite",
  "yolo-model-type": "yolox",
  "model": "/etc/models/yolox_quantized.tflite",
  "labels": "/etc/labels/yolox.json",
  "threshold": 40,
  "runtime": "dsp",
  "output-type": "waylandsink"

}

The following table lists and describes the fields in the config_detection.json file.

Field Values/description
ml-framework

Use one of the following models:

  • snpe: Qualcomm ® Neural Processing SDK
  • tflite: LiteRT
  • qnn: Qualcomm ® AI Engine direct
yolo-model-type Runs the yolov5, yolov8, yolox and yolonas models, respectively. For more information about models and labels, see the Sample model and label files.
runtime

Use one of the following runtimes:

  • cpu
  • gpu
  • dsp
Input source

Use one of the following input sources:

  • camera: Primary camera 0 or secondary camera 1
  • file-path: Directory path of the video file
  • rtsp-ip-port: Address of the RTSP stream in the rtsp://<ip>:<port>/<stream> format
  • enable-usb-camera: TRUE or FALSE
output-ip-address Output server IP address
port Output server port
output-type

Use one of the following output types:

  • waylandsink: To display the output on Wayland
  • filesink: To store the output in a file
  • rtspink: To stream the output on the server
USB camera video-format and resolution

Use one of the following video formats:

  • nv12
  • yuy2
  • mjpeg

Use one of the following resolution fields:

  • width: Input USB camera source resolution width
  • height: Input USB camera source resolution height
  • framerate: Input USB camera source framerate
output-file Output filename. The default filename is output_detection.mp4.

Run the app

gst-ai-object-detection

Object Detection with gst-lauch-1.0

Object detection using camera and udp sink

The download_artifacts.sh script downloads the models in /etc/models/ On board:

HOST_IP=X.X.X.X
PORT=5000
gst-launch-1.0 -e -v \
  qtiqmmfsrc camera=0 name=camsrc \
  camsrc. ! queue ! \
    'video/x-raw,format=NV12_Q08C,width=1280,height=720,framerate=30/1' ! \
    qtivcomposer name=mixer \
      sink_0::dimensions="<1280,720>" \
      sink_0::position="<0,0>" \
      sink_0::zorder=0 ! \
    queue ! \
    'video/x-raw,format=NV12,width=1280,height=720,framerate=30/1' ! \
    v4l2h264enc \
      capture-io-mode=dmabuf \
      output-io-mode=dmabuf-import ! \
    h264parse config-interval=1 ! \
    rtph264pay pt=96 config-interval=1 ! \
    udpsink \
      host=$HOST_IP \
      port=$PORT \
      sync=false \
      async=false \
  camsrc. ! queue ! \
    'video/x-raw,format=NV12,width=640,height=360,framerate=30/1' ! \
    qtimlvconverter ! \
    qtimltflite \
      model=/etc/models/yolox_quantized.tflite \
      delegate=external \
      external-delegate-path=libQnnTFLiteDelegate.so \
      external-delegate-options="QNNExternalDelegate,backend_type=htp" ! \
    qtimlpostprocess \
      module=yolov8 \
      labels=/etc/labels/yolox.json \
      results=10 \
      settings='{"confidence": 40.0}' ! \
    'video/x-raw,format=BGRA,width=640,height=360' ! \
    queue ! \
    mixer.

On Host PC:

PORT=5000
gst-launch-1.0 -v   udpsrc port=$PORT caps='application/x-rtp,media=video,encoding-name=H264,payload=96,clock-rate=90000' !   rtph264depay !   h264parse !   avdec_h264 !   videoconvert !   autovideosink sync=false

Object detection using filesrc and udp sink

On board:

HOST_IP=X.X.X.X
PORT=5000
gst-launch-1.0 -e -v \
  filesrc location=/etc/media/video.mp4 ! \
    qtdemux ! \
    h264parse ! \
    decodebin ! \
    identity sync=true ! \
    queue ! \
    qtivtransform ! \
    'video/x-raw,format=NV12,width=1280,height=720,framerate=30/1' ! \
    tee name=split \
  split. ! queue ! \
    qtivcomposer name=mixer \
      sink_0::dimensions="<1280,720>" \
      sink_0::position="<0,0>" \
      sink_0::zorder=0 ! \
    queue ! \
    'video/x-raw,format=NV12,width=1280,height=720,framerate=30/1' ! \
    v4l2h264enc \
      capture-io-mode=dmabuf \
      output-io-mode=dmabuf-import ! \
    h264parse config-interval=1 ! \
    rtph264pay pt=96 config-interval=1 ! \
    udpsink \
      host=$HOST_IP \
      port=$PORT \
      sync=false \
      async=false \
  split. ! queue ! \
    qtivtransform ! \
    'video/x-raw,format=NV12,width=640,height=360,framerate=30/1' ! \
    qtimlvconverter ! \
    qtimltflite \
      model=/etc/models/yolox_quantized.tflite \
      delegate=external \
      external-delegate-path=libQnnTFLiteDelegate.so \
      external-delegate-options="QNNExternalDelegate,backend_type=htp" ! \
    qtimlpostprocess \
      module=yolov8 \
      labels=/etc/labels/yolox.json \
      results=10 \
      settings='{"confidence": 40.0}' ! \
    'video/x-raw,format=BGRA,width=640,height=360' ! \
    queue ! \
    mixer.

On Host PC:

PORT=5000
gst-launch-1.0 -v   udpsrc port=$PORT caps='application/x-rtp,media=video,encoding-name=H264,payload=96,clock-rate=90000' !   rtph264depay !   h264parse !   avdec_h264 !   videoconvert !   autovideosink sync=false

Object detection using camera and filesink

OUT=ai_object_detection_video.mp4
gst-launch-1.0 -e -v \
  qtiqmmfsrc camera=1 name=camsrc \
  camsrc. ! queue ! \
    'video/x-raw,format=NV12_Q08C,width=1280,height=720,framerate=30/1' ! \
    qtivcomposer name=mixer \
      sink_0::dimensions="<1280,720>" \
      sink_0::position="<0,0>" \
      sink_0::zorder=0 ! \
    queue ! \
    'video/x-raw,format=NV12,width=1280,height=720,framerate=30/1' ! \
    v4l2h264enc \
      capture-io-mode=dmabuf \
      output-io-mode=dmabuf-import ! \
    h264parse config-interval=1 ! \
    mp4mux ! \
    filesink location=$OUT \
  camsrc. ! queue ! \
    'video/x-raw,format=NV12,width=640,height=360,framerate=30/1' ! \
    qtimlvconverter ! \
    qtimltflite \
      model=/etc/models/yolox_quantized.tflite \
      delegate=external \
      external-delegate-path=libQnnTFLiteDelegate.so \
      external-delegate-options="QNNExternalDelegate,backend_type=htp" ! \
    qtimlpostprocess \
      module=yolov8 \
      labels=/etc/labels/yolox.json \
      results=10 \
      settings='{"confidence": 40.0}' ! \
    'video/x-raw,format=BGRA,width=640,height=360' ! \
    queue ! \
    mixer.

Checking NPU use

You can DEBUG gstreamer pipeline further with:

export GST_DEBUG="qtimltflite:6,qtimlvconverter:4,qtimlpostprocess:4,*qnn*:6,*Qnn*:6"
export TFLITE_MINIMAL_LOG_LEVEL=0
export ADSP_LIBRARY_PATH="/usr/lib/rfsa/adsp:/usr/lib:/usr/lib/aarch64-linux-gnu"

To make sure the NPU is being used as backend, run a pipeline with different backends: HTP, CPU and GPU

Set the runtime paths

export LD_LIBRARY_PATH=/usr/lib:$LD_LIBRARY_PATH
export ADSP_LIBRARY_PATH=/usr/lib/rfsa/adsp:/usr/lib/rfsa/adsp/hexagon-v73:/usr/lib

HTP/NPU benchmark

Run this pipeline:

MODEL=/etc/models/yolox_quantized.tflite
LABELS=/etc/labels/yolox.json
DELEGATE=/usr/lib/libQnnTFLiteDelegate.so

GST_DEBUG=2 gst-launch-1.0 -e -v \
  qtiqmmfsrc camera=1 ! \
  'video/x-raw,format=NV12,width=1280,height=720,framerate=30/1' ! \
  qtimlvconverter ! \
  qtimltflite \
    model=$MODEL \
    delegate=external \
    external-delegate-path=$DELEGATE \
    external-delegate-options="QNNExternalDelegate,backend_type=htp" ! \
  qtimlpostprocess \
    module=yolov8 \
    labels=$LABELS \
    results=10 \
    settings='{"confidence": 40.0}' ! \
  fpsdisplaysink video-sink=fakesink text-overlay=false sync=false

After 10 seconds got this results:

...
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 234, dropped: 0, current: 30.18, average: 30.03
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 249, dropped: 0, current: 29.76, average: 30.01
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 265, dropped: 0, current: 30.17, average: 30.02
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 280, dropped: 0, current: 29.96, average: 30.02
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 295, dropped: 0, current: 29.91, average: 30.02

CPU using top

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2800 ubuntu    20   0 2210588 281000 137748 S   7.9   0.8   0:03.47 gst-launch-1.0

Information when adding perf element before fpsdisplaysink:

perf: perf0; timestamp: 0:04:01.269363368; bps: 73728000.000; mean_bps: 73728000.000; fps: 29.967; mean_fps: 29.739; cpu: 13;

Information when using watch -n 1 cat /sys/class/kgsl/kgsl-3d0/gpubusy:

4610 1000644

CPU baseline benchmark

Run this pipeline:

MODEL=/etc/models/yolox_quantized.tflite
LABELS=/etc/labels/yolox.json

GST_DEBUG=2 gst-launch-1.0 -e -v \
  qtiqmmfsrc camera=1 ! \
  'video/x-raw,format=NV12,width=1280,height=720,framerate=30/1' ! \
  qtimlvconverter ! \
  qtimltflite \
    model=$MODEL \
    delegate=none \
    threads=4 ! \
  qtimlpostprocess \
    module=yolov8 \
    labels=$LABELS \
    results=10 \
    settings='{"confidence": 40.0}' ! \
  fpsdisplaysink video-sink=fakesink text-overlay=false sync=false

This results in the following log after 10 seconds:

...
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 25, dropped: 0, current: 3.11, average: 3.23
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 27, dropped: 0, current: 3.10, average: 3.22
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 29, dropped: 0, current: 3.11, average: 3.21
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 31, dropped: 0, current: 3.11, average: 3.20
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 33, dropped: 0, current: 3.11, average: 3.20

CPU using top

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
4986 ubuntu    20   0  784476 151296  96640 S  98.3   0.4   0:06.89 gst-launch-1.0
Information when adding perf element before fpsdisplaysink:
perf: perf0; timestamp: 0:12:15.400971224; bps: 7372800.000; mean_bps: 8192000.000; fps: 3.157; mean_fps: 3.158; cpu: 15;

Information when using watch -n 1 cat /sys/class/kgsl/kgsl-3d0/gpubusy:

0       0

GPU baseline benchmark

Run this pipeline

gst-launch-1.0 -e -v   qtiqmmfsrc camera=1 !   'video/x-raw,format=NV12,width=1280,height=720,framerate=30/1' !   qtimlvconverter !   qtimltflite     model=$MODEL     delegate=gpu !   qtimlpostprocess     module=yolov8     labels=$LABELS     results=10     settings='{"confidence": 40.0}' !   fpsdisplaysink video-sink=fakesink text-overlay=false sync=false

Got this result after 10 seconds:

...
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 177, dropped: 0, current: 20.01, average: 20.43
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 188, dropped: 0, current: 20.41, average: 20.43
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 199, dropped: 0, current: 20.55, average: 20.44
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 210, dropped: 0, current: 20.63, average: 20.45
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 221, dropped: 0, current: 20.22, average: 20.44

CPU using top

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
6122 ubuntu    20   0 1000168 241008 163632 S  20.9   0.7   0:04.69 gst-launch-1.0

Information when adding perf element before fpsdisplaysink:

perf: perf0; timestamp: 0:10:20.605190113; bps: 51609600.000; mean_bps: 49971200.000; fps: 20.358; mean_fps: 20.372; cpu: 13;

Information when using watch -n 1 cat /sys/class/kgsl/kgsl-3d0/gpubusy:

874231 1007485

Summary of results

Tested on: Ubuntu 24.04

Resolution Backend CPU use (%) GPU use (%) Frames rendered after 10 s Average FPS
360p HTP 7.9 0.38 296 30.02
CPU 98.7 0 47 3.18
GPU 20.9 88.86 221 20.51
720p HTP 7.9 0.46 295 30.02
CPU 98.3 0 33 3.20
GPU 20.9 86.77 221 20.44
1080p HTP 7.6 0.49 298 30.02
CPU 98.7 0 33 3.26
GPU 21.3 87.66 209 20.39
GPU % is calculated with the two values gathered on the /sys/class/kgsl/kgsl-3d0/gpubusy file with the formula: GPU Busy Raw / GPU Total Raw * 100

Cookies help us deliver our services. By using our services, you agree to our use of cookies.